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Abstract 

The retrieval problem is the problem of associating data with keys in a set. Formally, the 
data structure must store a function / : U — > {0, l} r that has specified values on the elements 
of a given set S C U, \S\ — n, but may have any value on elements outside S. Minimal perfect 
hashing makes it possible to avoid storing the set S 7 but this induces a space overhead of 
O(n) bits in addition to the nr bits needed for function values. In this paper we show how to 
eliminate this overhead. Moreover, we show that for any k query time 0(k) can be achieved 
using space that is within a factor 1 + e~ k of optimal, asymptotically for large n. If we allow 
logarithmic evaluation time, the additive overhead can be reduced to O(loglogn) bits whp. The 
time to construct the data structure is O(n), expected. A main technical ingredient is to utilize 
existing tight bounds on the probability of almost square random matrices with rows of low 
weight to have full row rank. In addition to direct constructions, we point out a close connection 
between retrieval structures and hash tables where keys are stored in an array and some kind 
of probing scheme is used. Further, we propose a general reduction that transfers the results on 
retrieval into analogous results on approximate membership, a problem traditionally addressed 
using Bloom filters. Again, we show how to eliminate the space overhead present in previously 
known methods, and get arbitrarily close to the lower bound. The evaluation procedures of our 
data structures are extremely simple (similar to a Bloom filter). For the results stated above 
we assume free access to fully random hash functions. However, we show how to justify this 
assumption using extra space o(n) to simulate full randomness on a RAM. 
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1 Introduction 



Suppose we want to build a data structure that is able to distinguish between girls' and boys' 
names, in a collection of n names. Given a string not in the set of names, the data structure may 
return any answer. It is clear that in the worst case this data structure needs at least n bits, even 
if it is given access to the list of names. The previously best solution (implicit in [22]) that does 
not require the set of names to be stored requires around 1.23n bits. Surprisingly, as we will see 
in this paper, n + o(n) bits is enough, still allowing fast queries. If "global" hash functions, shared 
among all data structures, are available the space usage drops all the way to n + O(loglogn) bits 
whp. This is a rare example of a data structure with non-trivial functionality and a space usage 
that essentially matches the entropy lower bound. 

1.1 Problem definition 

The dictionary problem consists of storing a set S of n keys, and r bits of data associated with each 
key. A lookup query for x reports whether or not x 6 S, and in the positive case reports the data 
associated with x. We will denote the size of S by n, and assume that keys come from a set U of 
size n ^. In this paper, we restrict ourselves to the static problem, where S and the associated 
data are fixed and do not change. We study two relaxations of the static dictionary problem that 
allow data structures using less space than a full-fledged dictionary: 

• The retrieval problem differs from the dictionary problem in that the set S does not need to 
be stored. A retrieval query on x G S is required to report the data associated with x, while 
a retrieval query on x S may return any r-bit string. 

• The approximate membership problem consists of storing a data structure that supports mem- 
bership queries in the following manner: For a query on x £ S it is reported that x £ S. For a 
query on x S it is reported with probability at least 1 — e that x S, and with probability 
at most e that x £ S (a "false positive"). For simplicity we will assume that e is a negative 
power of 2. 

The model of computation is a unit cost RAM with a standard instruction set. For simplicity we 
assume that a key fits in a single machine word, and that associated values are no larger than keys. 
Some results will assume free access to fully random hash functions, such that any function value 
can be computed in constant time. (This is explicitly stated in such cases.) 

1.2 Motivation 

The approximate membership problem has attracted significant interest in recent years due to a 
number of applications, mainly in distributed systems and database systems, where false positives 
can be tolerated and space usage is crucial (see [3] for a survey) . Often the false positive probability 
that can be tolerated is relatively large, say, in the range 1% — 10%, which entails that the space 
usage can be made much smaller than what would be required to store S exactly. 

The retrieval problem shows up in situations where the amount of data associated with each 
key is small, and it is either known that queries will only be asked on keys in S, or where the 
answers returned for keys not in S do not matter. As an example, suppose that we have ranked the 
URLs of the World Wide Web on a 2 r step scale, where r is a small integer. Then a retrieval data 
structure would be able to provide the ranking of a given URL, without having to store the URL 
itself. The retrieval problem is also the key to obtaining a space-optimal RAM data structure that 
is able to answer range queries in constant time [U [25] . 
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1.3 Previous results 



Approximate membership. The study of approximate membership was initiated by Bloom [5] 
who described the Bloom filter data structure which provides an elegant, near-optimal solution to 
the problem: The data structure is a bit array. Use k = log 2 (l/e) hash functions to associate each key 
with k randomly located bits in the array. Set these bits to 1 for all x £ S, and put Os elsewhere. On 
a query for x, the Bloom filter reports that x € S if. and only if all bits associated with x are 1. Bloom 
showed^ that a space usage of nlog 2 (l/e) log 2 e bits suffices for a false positive probability of e. 
Carter et al. [9] showed that nlog 2 (l/e) bits are required for solving the approximate membership 
problem when \U\ 3> n (see Appendix [C] for details). Thus, the analysis of [5] shows that Bloom 
filters have space usage within a factor log 2 e ~ 1.44 of the lower bound, which is tight. 

Another approach to approximate membership is perfect hashing. A minimal perfect hash func- 
tion for S maps the keys of S bijectively to [n] = {0, . . . , n — 1}. Hagerup and Tholey [19] showed 
how to store a minimal perfect hash function h in a data structure of n log 2 e + o(n) bits such that 
it can be evaluated on a given input in constant time. This space usage is the best possible, up to 
the lower order term. Now store an array of n entries where, for each x G S, entry h(x) contains 
a log 2 (l/e)-bit hash signature q(x). When looking up a key x, we answer x £ S if and only if the 
hash signature at entry h(x) is equal to q(x). The origin of this idea is unknown to us, but it is 
described e.g. in [3]. The space usage for the resulting data structure differs from the lower bound 
nlog 2 (l/e) by the space required for the minimum perfect hash function, and improves upon Bloom 
filters when e < 2~ 4 and n is sufficiently large. 

Mitzenmacher [21] considered the encoding problem where the task is to represent and transmit 
an approximate set representation (no fast queries required). However, even in this case existing 
techniques have a space overhead similar to that of the perfect hashing approach. 

Retrieval. The retrieval problem has traditionally been addressed through the use of perfect 
hashing. Using the Hagerup-Tholey data structure yields a space usage of nr + n log 2 e + o(n) bits 
with constant query time. 

Recently, Chazelle et al. pT] presented a different approach to the problem based on an idea 
similar to that of a Bloom filter: Each key is associated with k = 0(1) locations in an array with 
0{n) entries of r bits. The answer to a retrieval query on x is found by combining the values 
of entries associated with x, using bit- wise XOR. In place of the XOR operation, any abelian 
group operation may be used. In fact, this idea was used earlier by Majewski, Wormald, Havas, 
and Czech [22] and by Seiden and Hirschberg [31] to address the special case of order-preserving 
minimal perfect hashing. It is not hard to see that these data structure in fact solve the retrieval 
problem. The main result of [25] is that for k = 3 a space usage of around 1.23nr bits is possible, 
and this is the best possible using the construction algorithm of [11} [22] (other values of k give 
worse results). 

The approach of these papers does not give a data structure that is more efficient than perfect 
hashing, asymptotically for large n, but the simplicity and the lack of lower order terms in the 
space usage that may dominate for small n makes it interesting from a practical viewpoint. A 
particular feature is that (like for Bloom filters) all memory lookups are nonadaptive, i.e., the 
memory addresses can be determined from the query only. This can be exploited by modern CPU 
architectures that are able to parallelize memory lookups (see e.g. [33]). 

In fact, Chazelle et al. also show how approximate membership can be incorporated into their 
data structure by extending array entries to r + log 2 (l/e) bits. This generalized data structure is 

1 Bloom used a certain simplifying assumption, independence of certain slightly correlated events, that has since 
been justified, see [3]. 
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called a Bloomier filter. Again, the space usage is a constant factor higher, asymptotically, than 
the solution based on perfect hashing. 

1.4 New contributions 

Our first contribution shows that the approach of [22], 131 j can be used to achieve space for 
retrieval that is very close to the lower bound: 

Theorem 1 For any 7 > 0, r = O(logn), and any sufficiently large n there exist data structures 
for the retrieval problem having the following space and time complexity on a unit cost RAM with 
free access to a fully random hash function: 

(a) Space nr + O (log log n) bits whp^, query time O(logn), expected construction time 0(n 3 ). 

(b) Space (1 + j)nr bits, query time 0(1 + log(i)) ; expected construction time 0(n). 

Our basic data structure and query evaluation algorithm is the same as in |11[ [22] . The new 
contribution is to analyze a different construction algorithm (suggested in [31]) that is able to 
achieve a better space usage. Our analysis needs tools and theorems from linear algebra, while that 
of (111 [22] is based on random graph theory ([31] provided only experimental results). To get a 
data structure that allows expected linear construction time we devise a new variant of the data 
structure and query evaluation algorithm, retaining simplicity and non-adaptivity. 

Our second contribution is to point out an intimate connection between the approximate mem- 
bership problem and the retrieval problem: 

Theorem 2 Assuming free access to fully random hash functions, any static retrieval data structure 
can be used to implement an approximate membership data structure having false positive probability 
2~ r , with no additional cost in space, and O(l) extra time. This reduction is near-optimal in the 
sense that it can be used to solve the approximate membership problem in space that is within 
O(loglogn) bits of optimal whp. 

The papers on Bloom filters, and the papers on retrieval [22] all make the assumption of 
access to fully random hash functions, as in the above. We show how our data structures can be 
realized on a RAM, with a small additional cost in space: 

Theorem 3 In the setting of Theorem d for some e > 0, we can avoid the assumption of fully 
random hash functions and get data structures with the following space and time complexities: 

(a) Space nr + Oin 1 ^ 6 ) bits, query time O(logre), expected construction time 0(n 1+7 ). 

(b) Space (1 + "f)nr bits, query time 0(1 +log(i)) ; expected construction time 0(n). 

Our results have a couple of other implications in data structures. We improve the space usage 
of a recent simple construction of (minimal) perfect hashing of Botelho et al. |4J (Section [7]). In 
addition, we show a close relationship between "cuckoo hashing" -like dictionaries and retrieval 
structures (Section [6|). This implies improved upper bounds on the space usage of A:-ary cuckoo 
hashing [18J (or equivalently, of the 1-orientability threshold of a /c-uniform random hypergraph 
with n edges). 

2 "whp." means with probability 1 — 0( ■> ). 
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1.5 Overview of paper 

Section[2]describes our basic retrieval data structure and its analysis, using a result due to Calkin [8]. 
This leads to part (b) of Theorem [H except that the construction time is 0(n 3 ). Part (a) is shown 
in Section [3l using a result of Cooper [13]. The reduction of approximate membership to retrieval, 
Theorem [21 is presented in Section [H Section [5] completes the proof of part (b) of Theorem [1] by 
showing how the construction algorithm can be made to run in linear time. Section [7] describes an 
application of our results to perfect hashing. Some issues, such as circumventing the full randomness 
assumption, leading to Theorem El are discussed in the appendices. 

2 Retrieval in constant time and almost optimal space 

In this section, we give the basic construction of a data structure for retrieval with constant time 
lookup operation and (1 + 5)nr space. As a technical basis, we start with describing a result by 
Calkin [8] regarding the probability that 0-1-matrices with sparse rows chosen randomly have full 
row rank. 

2.1 Calkin's results 

All calculations are over the field GF(2) = Z2 with 2 elements. We consider binary matrices M = 
(Pij)i<i<n,o<j<m with n rows and m columns. If M is such a matrix, then row vector (pn), . . . , pi^ m -i) 
is called pi, for 1 < i < n. 

Theorem 4 (Calkin [8, Theorem 1.2]) For every k > 2 there is a constant (3^ < 1 such that 
the following holds: Assume the n rows p\, . . . ,p n of a matrix M are chosen at random from the 
set of binary vectors of length m and weight (number of Is) exactly k. Then: 

(a) If n/m < (3 < (3^, then Pr(M has full row rank) — > 1 (as n — > oo). 

(b) If n/m > [3 > (3k, then Pr(M has full row rank) — > (as n — > oo). 

Furthermore, 0k — (1 — (e -fc /(ln2)) — ► for k — » oo (exponentially fast in k). 

Remark 1 It has been noted earlier in related work [22] that the question whether a matrix with 
m columns and randomly chosen rows of weight 2 has full row rank is equivalent to the question 
whether the graph with M as its vertex-edge incidence matrix is cyclic. The threshold value for 
this case is (3<z = 2, as is well known from the theory of random graphs. In [22] and [lj it is explored 
how this fact can be used for constructing perfect hash functions, in a way that implicitly includes 
the construction of retrieval structures. 

Remark 2 A closer look into the proof of Theorem 1.2 in [8] reveals that for each k there is 
some e = £k > such that in the situation of Theorem [U[a) we have Pr(M has full row rank) = 
1 — 0(n~ e ). The following values are suitable: £3 = |, £4 = |, = 1 for k > 5. 

According to [8], the threshold value (3^ is characterized as follows: Define 

f(a,(3) = - In 2 - alna - (1 - a)ln(l-a) +/31n(l + (1 - 2a) k ), (1) 

for < a < 1. Let (3^ be the minimal (3 so that f(a,(3) attains the value for some a G (0, |). 
Using a computer algebra system, it is easy to find approximate values for small k, see Table [TJ 
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k 


3 


4 


5 


6 


0k 


0.88949 


0.96714 


0.98916 


0.99622 


^appr 


0.9091 


0.9690 


0.9893 


0.99624 




1.1243 


1.034 


1.011 


1.0038 



Table 1: Approximate threshold values from Theorem [H using (JTJ) and ([2]). 

This table also lists upper bounds for the reciprocals /J^T 1 , since these are the figures we will utilize 
later. Calkin further proves that 

* - 1 - £ - ^ - * + £ - o ■ ^ ± °^ ■ < 2 > 

as k — » oo. It seems that the approximation obtained by omitting the last term in ([2]) is quite good 
already for small values of k. (See the row for (3^ ppr in Table [TJ) 

Remark 3 Results similar to those of Calkin [8], but for a different model, were obtained 
independently by Balakin, Kolchin, and Khokhlov [2[ |2U[ I2T]. Further results in a similar vein can 
be found in a paper by Cooper [12] , 

2.2 The basic retrieval data structure 

Now we are ready to describe a retrieval data structure. Assume f:S—> {0, l} r is given, for a set 
S = {x\, . . . ,x n }. For a given (fixed) k > 3 let 1 + 5 > (3^ be arbitrary and let m = (1 + 8)n. 
We can arrange the lookup time to be 0(k) and the number of bits in the data structure to be 
mr = (1 + 5)nr plus lower order terms H 

We assume that we have access to k hash functions with ranges [m] , . . . , [m — k + 1] that behave 
fully randomly on the keys of S, and that, in case the construction below fails, we may switch 
to a new independent set of k hash functions, again random on S. It is not hard to see that this 

f[ m ]\ x 

assumption makes it possible to define a mapping U 3 x i— ► A x G ( ^ J, where ( fc ) denotes the 

set of all subsets of X with k elements, so that computing A x from x G U takes time 0(k), and so 
that {A x ) x< zs is fully random on S. (For details see Appendix [A]) We need to store a few bits to 
record which set of hash functions was used in the successful construction. 

The construction starts from = {xi, . . . , x n } and the bit strings Uj = f(xi) G {0, l} r , 1 < i < n. 
We consider the matrix 

M = {Pij)i<i<n,o<j<m, with pij = 1 if j G and pij = otherwise. (3) 

Theorem H^a) (with Remark [2J says that M has full row rank with probability 1 — 0{n~ e ) for some 
e > 0. Assume n is so large that this happens with probability at least |. 

If M does have full row rank, the column space of M is all of {0, l} n , hence for all u G {0, l} n 
there is some a G {0, l} m with M ■ a = u. More generally, we arrange the bit strings u\, . . . ,u n G 
{0, l} r as a column vector u = («i, . . . ,m„) t . We stretch notation a bit (but in a natural way) so 
that we can multiply binary matrices with vectors of r-bit strings: multiplication is just bit/vector 
multiplication and addition is bitwise XOR. It is then easy to see, working with the components of 

3 For simplicity, in the notation we suppress rounding nonintegral values to a suitable near integer. 
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the Ui separately, that there is a (column) vector a = (ao, • • • , a m „i) T with entries in {0, l} r such 
that M ■ a = u. — We can rephrase this as follows (using © as notation for bitwise XOR): For 
a G ({0, l} r ) m and x G U define 

K(x) = a,, (4) 

Then for an arbitrary sequence (u±, . . . ,u n ) of prescribed values from {0, l} r there is some a G 
({0, l} r ) m with h a {xi) = Ui, for 1 < i < n. Such a vector a G ({0, l} r ) m , together with an identifier 
for the set of hash functions used in the successful construction, is a data structure for retrieving 
value Ui = f(xi), given X{. There are k accesses to the data structure, plus the effort to evaluate k 
hash functions on x and calculate the set A x from x (see Appendix lAj) . 

Remark 4 A similar construction (over arbitrary fields GF(g)) was described by Seiden and 
Hirschberg [31] . However, those authors did not have Calkin's results, and so could not give 
theoretical bounds on the number m of columns needed. Also, our construction generalizes the 
approach of Chazelle et al. [Ill [22] who required that M could be transformed into echelon form 
by permuting rows and columns, which is sufficient, but not necessary, for M to have full row rank. 

Some details of the construction are missing. We describe one of several possible ways to proceed. 
- From S, we first calculate the sets A Xi , 1 < i < n, in time 0(n). Using Gaussian elimination, 
we can check whether the induced matrix M = (p«) has full row rank. If this is not the case, we 
start all over with a new set of k hash functions, leading to new sets A Xi . This is repeated until 
a suitable matrix M is obtained. The expected number of repetitions is 1 + 0(n~ £ ). For a matrix 
M with independent rows Gaussian elimination will also yield a "pseudoinverse" of M, that is, 
an invertible n x n-matrix C (coding a sequence of elementary row transformations without row 
exchanges) with the property that in C ■ M the n unit vectors occur as columns: 

Vi, 1 < i < n, 3 bi G [m\. column h of C ■ M equals ej = (0, . . . , 0, 1, 0, . . . , 0) T . (5) 

Given u = (m, • • , u-n) £ {0, l} n we wish to find a solution a G {0, l} m of the system 

(C-M)-a = C-u = u = (u[, . . . ,<) T . (6) 

Since C ■ M has the unit vectors in columns ,b n , we can easily read off a special a that 

solves dH): Let cij = for j ^ . . . , b n }, and let = u\ for 1 < % < n. Exactly the same formula 
works if u, u', and a are vectors of r-bit strings. — We have established the following. 

Theorem 5 Assume that fully random hash functions from keys x G S to ranges [m], . . . , [m— k+1] 
are available (with the option to choose such functions repeatedly and independently). Let k > 2 be 
fixed, and let 1 + 5 > j3Z ■ 

Then for n large enough the following holds: Given S = {x%, . . . ,x n } and a sequence (u\, . . . ,u n ) 
of prescribed elements in {0, l} r , we can find a vector a = (ao, • • • , a m — l) with elements in {0, l} r 
such that h a (xi) = Ui, for 1 < i < n. 

The expected construction time is 0(n 3 ), the scratch space needed is 0(n 2 ). 

Remark 5 We note that only n entries of a are significant, namely entries b\, . . . , b n . The other 
entries can be chosen arbitrarily, for example equal to 0. This observation implies that there is 
an alternative data structure whose redundancy is independent of r. A constant time rank data 
structure (e.g. Theorem 4.4 in [27]) can be used to identify the entries in {b\, . . . , b n } and map them 
to entries in a "compressed" array of size n. The space usage of the rank data structure is within 
a lower order term of the entropy of the set {b%, . . . ,b n }, which is log 2 ( m ). The drawback of this 
is that accesses to the compressed array are now adaptive, as they depend on lookups in the rank 
data structure. 
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Remark 6 At the first glance, the time complexity of the construction seems to be forbiddingly 
large. However, using a trick described in Appendix [B] ( "split-and-share" ) makes it possible to 
obtain a data structure with the same functionality and space bounds (up to a o(n) term) in time 
0(n 1+7 ) for any given 7 > 0. In Section [5] we show how to construct a retrieval structure with 
essentially the same space requirements in expected linear time. 

3 A retrieval structure with optimal space 

The purpose of this section is to prove Theorem QJa), ie., to show that there is a data structure 
supporting retrieval that requires only space nr + O(loglogn) bits whp. (note that nr bits is a 
lower bound), and in which one retrieval operation takes logarithmic time. More precisely, O(logre) 
table entries are read and combined by XOR. The idea is the following. We use the same setup as 
in Section E21 excepting that the range size m is equal to re, and that k, the size of the sets A x , 
is chosen to be 0(logn). We set up the matrix M as in (|3|). The rows correspond to the re keys 
xi, . . . , x n in S, the columns to the range [re]. In order to argue that the induced n x re-matrix M 
has full rank at least with constant probability, we wish to use the following theorem. 

Theorem 6 (Cooper [13] . Theorem 2(a)) Let M = (pij)i<i< n .o<j<n be an n x n-matrix filled 
with 0s and Is in the following way: The entries are chosen independently, and each entry pij is 1 
with probability p = p{n) = (c+1) log(re)/re ; and with probability 1—p, where c > is an arbitrary 
constant. Then lim Pr(M is regular) = c%, where C2 = H (1 — 2~*) ~ 0.28879. 

rt ^°° ' " l<i<n 

Note. C2 is the probability that a random 0-1-matrix, each entry being or 1 with probability 
i, is regular. We will work with the constant c = 1 or p = 21og(n)/n throughout, but any other 
constant would do as well. The statement of the theorem in Cooper's paper is even more general. 

A slight difficulty arises in that the number of Is in a row is fixed to be k in the setup of 
Section \2. 21 and is binomially distributed in Cooper's theorem. The idea to resolve this is as follows: 
The size k(x) of set A x , for x £ S, is chosen at random according to the binomial distribution. Then 
the rows of M will have weight 0(logn) with high probability, as noted in the following lemma, 
which is easy to prove by Chernoff bounds, e. g., |23|, Theorems 4.4 and 4.5]. 

Lemma 7 In the situation of Theorem® with c = 1, the probability that M has a row in which 
there are more than 41ogn or fewer than ^logn Is is 

3.1 Sampling the binomial distribution 

Using one extra hash function q (with range [re 3 ], say), for each x 6 S we can choose a number 
k(x) at random from the binomial distribution conditioned on [4 log re, 4 log n]. Then Lemma 
implies that the deviation in probability from the situation of Theorem [6] is o(l). Specifically, 
assume re and p = 21og(re)/re are given, and that a fully random hash function g with range 
[n 4 ] is available. We wish to sample from B(re,p) conditioned on log re, 4 log re] = [knp, 2np], 
at least approximately. For this, we prepare a table of all O(logn) values F(i) = Pr(X < i), 
Tjlogre < i < 4 log re, for the corresponding distribution function F. For given g(x) we find i with 
\{F{i - 1) + F(i)) < g(x)/n 4 < \{F{i) + F B (i + 1)), and return i. An easy calculation shows that 
both B(re,p; \np) and B(re,p; ^p) are in [n _1 ,re -2 ], so that the error we make in comparison to the 
true binomial distribution is in 0(l/re 2 ), which for our purposes is negligible. 

Having fixed k(x), x € S, using 4 log re further hash functions with ranges [re — £+1], 1 < £ < 
41ogn, for each x & S a set A x = {h\(x), h2(x), . . . , hk^(x)} is chosen at random from ($,))■ (F° r 
details see Appendix IA"1) 
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3.2 Putting things together 

According to Theorem [6] the resulting matrix M will have full rank with probability 0.28879 + o(l). 
If it turns out M is singular, we start all over, with a new set of 1 + 41ogn hash functions. With 
0(log n) such trials, the probability that all resulting matrices M are singular can be made as small 
as n~ d for an arbitrary constant d. The information which set of hash functions succeeds is part 
of the data structure. Recording this information takes at most O(loglogn) bits. The remaining 
details of the construction are the same as in Section [2.21 

Lookup works as follows: Given a key x G U, use q to calculate k(x). If this happens to be outside 
the range [i logn, 41ogn], return an arbitrary value. Otherwise calculate A x = {h\(x),h2(x), ... , 
hk(x)( x )} an d return the value given by (HJ), with k(x) in place of k. 

This proves Theorem QJ a). We do not try to reduce the cost 0(n 3 ) for solving the linear system. 
Using the "split-and-share" trick described in Appendix [B] we may avoid the assumption of fully 
random hash functions and obtain a retrieval structure as described in Theorem E^a). 

4 Approximate membership 

We prove Theorem [2j — Let S C U with \S\ = n and an error bound 2~ s be given. We let 
q: U — > [2 s ] be a fully random hash function. Using the methods from Sections 12.21 resp. [3] we 
build a retrieval structure D that associates value q(x) with x G S. The space requirements for the 
data structures and the times for construction and retrieval are inherited from the corresponding 
retrieval structures. 

A query for x G U returns "yes" if D(x) = q{x) and "no" otherwise. It is then clear that a 
query for an x G S always yields "yes" . A query for an x G U — S yields "yes" with probability 
2~ s , since D(x) is given by the data structure, which is determined by q(y),y G S (and some other 
random choices), and hence is independent of q(x). 

Remark 7 Of course, we may combine the retrieval data structure of Section 12.21 with the ap- 
proximate membership data structure just described to obtain a data structure that needs space 
(1 + 5)n{r + s) and has the functionality of a "Bloomier filter" as described in [11]: On a query for 
x G S, a prescribed value f(x) G {0, l} r is returned, while for x G U — S the probability that some 
value from {0, l} r (and not some error symbol) is returned is 2~ s . 

Remark 8 If we drop the assumption of fully random hash functions (like q) being provided for 
free, only a o(l) term has to be added to the false positive probability. This is briefly discussed in 
Remark [9] in Appendix [Bj 

5 Retrieval in almost optimal space, with linear construction time 

In this section we show how, using a variant of the retrieval data structure described in Section \2.2\ 
we can achieve linear expected construction time and still get arbitrarily close to optimal space. 
This will prove Theorem QJb). The reader should be aware that the results in this section hold 
asymptotically, only for rather large n. 

Using the notation of Sections 12. II and !2.21 we fix some k and some 5 > such that (1+5) 0k > 1. 
Further, some constant e > is fixed. We assume that 2k + 1 fully random hash functions are at 
our disposal, with ranges we can choose, and in case the construction fails we can choose a new set 
of such functions, even repeatedly. (Appendix IBI explains how this can be justified.) 
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Define b = hy/lo^n. We assume that e and 5 are so small that (1 + e) 2 (l + 8) < 4, and hence 
that b- 2( 1 + £ ) 2 ( 1 +' 5 ) fe2 = o(n/(logn) 3 ). 

Assume /: S — > {0, l} r is given, the value f(x) being denoted by u x . The global setup is as 
follows: We use one fully random hash function <p to map S into the range [mo] with too = n/b. In 
this way, too blocks Bi = {x £ S \ tp(x) = i}, < i < too, are created, each with expected size b. 
The construction has two parts: a primary structure and a secondary structure for the "overflow 
keys" that cannot be accommodated in the primary structure. This is similar to the global structure 
of a number of well-known dictionary implementations. For the primary structure, we try to apply 
the construction from Section 12.21 to each of the blocks separately, but only once, with a fixed set 
of k hash functions. This construction may fail for one of two reasons: (i) the block may be too 
large — we do not allow more than b' = (1 + e)b keys in a block if it is to be treated in the primary 
structure, or (ii) the construction from Section [2.21 fails because the row vectors in the matrix M, 
induced by the sets A x , x £ Bi, are not linearly independent. 

For the primary structure, we set up a table T with (1 + 5)(1 + e)n entries, partitioned into 
too segments of size (1 + 5)(1 + e)b = (1 + 5)b' . Segment number i is associated with block Bi. 
If the construction from Section 12.21 fails, we set all the bits in segment number i to and use 
the secondary structure to associate keys in Bi with the correct values. As secondary structure we 
choose a retrieval structure as in [TU [22] , built on the basis of a second set of, say, 3 hash functions 
(used to associate sets A' x C [1.3n'] with the keys x £ S') and a table T"[0..1.3n' — 1]. This uses 
space 1.3n'r bits, where n' is the size of the set S' of keys for which the construction failed (the 
"overflow keys"). Of course, the secondary structure associates a value f'(x) with any key x £ S. 
Rather than storing information about which blocks succeed we compensate for the contribution 
from f'(x) as follows: If the construction succeeds for Bi, we store (1 + 5)b' vectors of length r in 
segment number i of table T so that x £ Bi is associated with the value f(x) © f'{x). On a query 
for x £ U, calculate i = f(x), then the offset di = (i — 1)(1 + S)b' of the segment for block Bi in 
T' , and return 

e Tij+d,} © e T'\j], 

j£A x jeA' x 

© representing bitwise XOR in {0, l} r . It is clear that for x £ S the result will be f(x): For x £ S' 
the two terms are and f(x), and for x S' the two terms are f'(x) and f{x) © f'(x). Note that 
the accesses to the tables are nonadaptive: all k + 3 lookups may be carried out in parallel. In fact, 
if T and T' are concatenated this can be seen as the same evaluation procedure as in our basic 
algorithm (|4|), the difference being that the hash functions were chosen in a different way (e.g., do 
not all have the same range). 

Lemma [8] below says that E(n') = o(n). Before proving the lemma we conclude the space 
analysis assuming that it is true. The overall space is (1 + e))(l + e)n(r + 1/6) + c|S"|r bits (apart 
from lower order terms). If 7 > is given, we may choose e and 5 (and k) so that this bound is 
smaller than (1 + 7)nr for n large enough. 

We proceed to show that that \S'\ = o(n) and how to achieve construction time 0(n). 

Lemma 8 The expected number of overflow keys is o(n). 

Proof: It is sufficient to show that the expected number of blocks that have more than (1 + e)b 
keys or for which the construction from Section [2.21 fails is o(mo). Let xq £ S be an arbitrary key. 
The number of keys colliding with xq is B(n — 1, ^)-distributed; we may assume it is B(n, — )- 
distributed. The expected number of keys that collide with xq under ip then is n/mo < b. Since 
b' = {l + e)b, a standard Chernoff bound ([23, Thm. 4.4]) together with the fact that b = w(l) yields 
that Pr(x collides with more than b' keys) < e~ he / 3 = o(l). Hence the expected number of keys in 
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overfull blocks is o(n). Now assume a block Bi has no more than (l + e)b keys, and the construction 
from Section \2. 21 is applied (once). There we noted, referring to Remark [21 that the probability that 
the b' x (1 + <5)6'-matrix induced by the sets A x , x G Bi will have linearly dependent rows is l/fr^ 1 ), 
which is o(l) again. Hence the expected number of keys in blocks that create matrices that do not 
have full rank is o(n). □ 

5.1 Matrix computations using tables 

As a building block in our construction algorithm for the primary structure we need some tables 
that allow us to do computations on small matrices efficiently The details offer no surprises, but 
are given here for completeness. 

Let e, 5 > be such that (l + e) 2 (l + 5) < 4. For given n define b = ^y / Iogn, and let V = (l + 8)b. 
We want to set up auxiliary tables that help in dealing with binary matrices M with < b' rows 
and (1 + 5)b' columns. There are no more than b' • 2( 1+£ ) 2 ( 1+<5 ) b2 = o(n/(logn) 3 ) such matrices. For 
each such matrix M we determinine and store whether its rows are independent; if this is the case 
we also calculate and store a pseudoinverse C, as described in Section I2T21 The overall space needed 
for this table is o(n) bits, and the total time to calculate the entries is o(n) steps, even if a simple 
method like Gaussian elimination is used. 

In Section T2.2I we described what we mean by multiplying a matrix with a vector of bit strings. 
For some integer constant I > 3 we prepare a table of all matrix-vector pairs (L, v) and their 
product L ■ v, where the 0-1-matrix L has < b'/£ rows and b'(l + e)/l columns, and the vector 
v has b'(l + e)/£ entries that are bit strings of length (logn)/£. Such a table makes it possible to 
multiply a matrix with < b' rows and b'(l + e) columns with a vector whose entries are bit strings 
of length O(logn) in 0(fi) word operations, hence in 0(1) time. This table has size o(n) bits as 
well. 

5.2 Primary structure construction algorithm 

We are now ready to show the following lemma. 

Lemma 9 The primary structure can be constructed in time 0(n). 

Proof: It is clear that linear time is sufficient to find the blocks Bi and identify the blocks that are 
too large. Now consider a fixed block Bi of size at most (1 + e)b. We must evaluate \Bi\ ■ k hash 
functions to find the sets A x , x E Bi, and can piece together the matrix Mi that is induced by 
these sets in time 0(b) (assuming one can establish a word of 0(b) Os in constant time and set a 
bit in such a word given by its position in constant time). The whole matrix has fewer than logn 
bits and fits into a single word. This makes it possible to use the precomputed tables described 
in Section 15. 1L Similarly, using precomputed tables we can check in constant time whether Mi has 
linearly independent rows or not and in the positive case find a pseudoinverse Cj. 

Now assume a bit vector u = (m, . . . , U|#.|) T G {0, l}'^', is given. Using Cj and a lookup table 
as in Section [5.11 we can find Ci ■ u in constant time. A bit vector a = (aj)i<j<(i+<5)f/ that solves 
Mi ■ a = u can then be found in time 0(b). This leads to an overall construction time of 0(n) for 
the whole primary structure. 

If the values in the range are bit vectors f(x) = u x G {0, l} r , x G Bi, a construction in 
time 0(nr) follows trivially. We may improve this time bound as follows: The lookup tables from 
Section [5.11 make it possible to multiply Ci even with vectors of length up to O(logn) in constant 
time. This establishes a construction time of 0(n) for the general problem of representing a function 
/: S -> {0, l} r , using r = O(logn). □ 

We have proved the following result, which is a precise and more general version of Theorem [T](b): 
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Theorem 10 There is an algorithm A with the following properties. For every 7 > there is some 
k = 0(log(l/7)) such that for all sufficiently large n the following holds: Given a set S C U with 
n = \S\ and a function f : S —* {0, l} r , with probability 1 — o(l) algorithm A will succeed in building 
a data structure D such that: 

(a) D supports retrieval in time 0(k), with no more than 2k + 1 hash function evaluations and 
2k + 1 {nonadaptive) random accesses into tables storing r-bit vectors or bits] 

(b) the space occupied by D is no more than (1 + ^)nr bits; 

(c) A runs in time 0(n). 

6 Retrieval and dictionaries by balanced allocation 

In several recent papers, the following scenario for (statically) storing a set S C U of keys was 
studied. A set S = {x±, . . . ,x n } C XJ is to be stored in a table T[0..m — 1] of size m = (1 + 5)n 
as follows: To each key x we associate a set A x C [m] of k possible table positions. Assume there 
is a mapping a: {l,...,n} — > [m] that is one-to-one and satisfies a(i) £ A^, for 1 < i < n. 
(In this case we say (A x ,x £ S) is suitable for S.) Choose one such mapping and store X{ in 
T[cr(i)] . Examples of constructions that follow this scheme are cuckoo hashing |28j . £>ary cuckoo 
hashing [18], blocked cuckoo hashing [TU [29], and perfectly balanced allocation [14]. In [6j [17] 
threshold densities for blocked cuckoo hashing were determined exactly. These schemes are the 
most space-efficient dictionary structures known, among schemes that store the keys explicitly in a 
hash table. For example, k-axy cuckoo hashing [18] works in space m = (1 + £fc)n with = e~®^ k \ 
Perfectly balanced allocation [14] works in optimal space m = n with A x consisting of 2 contiguous 
segments of [n] of length O(logn) eacL0. 

Here, we point out a close relationship between dictionary structures of this kind and retrieval 
structures for functions f:S—>R, whenever the range R is not too small. We will assume that 
R = F for a finite field F with |F| > n. (Using a simple splitting trick, this condition can be 
attenuated to |F| > n s , see Appendix [Bl) 

From Section [2T2l we recall equation ([3]) where the matrix M = {pij)i<i< n o<j<m was defined 
from the sets A x , x E S. 

Observation 1 Let F be an arbitrary field. If the Is in M can be replaced by elements o/F in such 
a way that the resulting matrix M' = (p'ij) has full row rank over F, then (A x ,x £ S) is suitable 
for S. 

Proof: If M' has full row rank, it has an n x n submatrix N with nonzero determinant. By 
the definition of the determinant there must be a mapping a: {1, . . . ,n} — > [m] with nPi CT (i) ^ ^' 
hence Pi a ^) = 1 for 1 < % < n. ■ 

The observation implies that Calkin's bounds from Section [2] give upper space bounds for 
dictionary constructions like A:-ary cuckoo hashing. We note that the figures from Table Q] coincide 
nicely with the space bounds found by experiments in [18], and are much tighter than the theoretical 
results found there. Surprisingly, for fields that are not too small, the observation also works the 
other way around: existence of a dictionary implies existence of a retrieval structure. 

4 The result of |14] only directly applies when n is divisible by the segment length. However, from the proof it is 
clear that one also gets a perfectly balanced allocation in the case where segment sizes may differ by f . 
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Theorem 11 Assume a mapping x t— > is given that is suitable for S. Then the following holds: 
If gi, . . . , gk : S — * F are random, then with probability at least 1 — ■& the matrix 

M ' = (Pij)i<i<nO<j<m, withp\j = g e (xi) if j = hi(xi) and p'^ = otherwise (7) 
has full row rank over F. 

Proof: Define an auxiliary matrix P = (sij) of polynomials in variables X{j, 1 < i < n, by 
Sjj = if j = hi(xi) and s^- = for all other j's. Then the submatrix P' of P that corresponds 
to the columns cr(l), . . . , cr(n) of P satisfies 

det(P') ^ the O-polynomial. 

This is because the mapping i i— ► cr(i) yields one term in the expansion of det(P') as a sum that 
does not vanish, and because the terms in this sum cannot cancel in any way. By the Schwartz- 
Zippel Theorem (see e.g. [26]) we know that if we substitute random elements gi(xi) from F for 
the variables in P', the probability that the resulting matrix P'[Xij\g^{xi)\ (with j = hi{xj)) is 
regular is at least 

1 -deg(det(P'))/|F| > 1 -n/|F| . 
Clearly, if P'[Xij\gi(xi)] has full rank, the extension P[Xij\gi(xi)] = M' has full row rank. ■ 

The theorem implies the following: If the mapping x \— » A x is suitable for S, if |F| > 2n, and if 
we have hash functions g±, . . . , : U — > F that are random on S, then with probability at least I 
we can build a retrieval structure for a function /: S — > F consisting of a table T[0..m — 1] with 
entries from F with f{x) = J2i<e<k 9e( x ) ' T[he(x)]. If we can switch to new functions <7i, • • • , <?& if 
necessary, this construction succeeds in O(logn) iterations whp. 

For example, from the dictionary constructions in |16| . or p3], resp., we obtain retrieval struc- 
tures with a table of size < (1 + e~ k )n and lookup time O(k), or optimal size n and lookup time 
O(logra), resp. In both cases for one retrieval operation we need to access only two contiguous 
segments of the table T, which makes these implementations very cache-friendly. Another, even 
more cache friendly, possibility is to do linear probing in a table of size (1 + f2(l))n, in which case 
the lookup time can be bounded by O(logn) with high probability. H. 

7 Application to perfect hashing 

The problem of perfect hashing is the following: Given S C U, construct a data structure that 
makes it possible to calculate h(x) £ [m] for x £ U (fast), so that h is 1-to-l on S. We show that 
the constructions of Chazelle et al. [llj and Botelho et al. [1], resp., which lead to space-efficient 
representations that can be evaluated fast, can be enhanced so as to achieve a smaller space bound. 

As before, we assume we have k hash functions with ranges [m], . . . , [m — k + 1] that are fully 
random on S. As described in Appendix [A] this yields a mechanism to calculate an ordered set 
A x = {h\(x), . . . ,hk(x)} of distinct values in [m], for x £ U, so that A x , x £ S, is fully random. 
This set system can be regarded as a hypergraph. Matrix M = (pij)i<i< n fi<j<m from ([3]) is the 
vertex-edge incidence matrix of this hypergraph. 

We were not able to find this fact in the literature, but from Chernoff bounds it is easy to see that with high 
probability any interval of length i > Clogn, with a suitably large constant C depending on a, contains at most i 
hash values. This implies that the maximum probe length is bounded by i. 
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Chazelle et al. [TT] (implicitly) and Botelho et al. [I] have proposed a method for obtaining a 
perfect hash function in this situation. If M can be brought into echelon form by permuting rows 
and columns, then each row has a distinguished column, namely the column in which the first 1 
from the left in this row is positioned. The column index is an element of A x . Thus, we obtain 
a one-to-one-mapping ip from {1, . . . , n} into {0, . . . , m — 1} so that tp(i) G A Xi . This means that 
for each i (or Xi) there is a unique G {0, . . . , k — 1} such that <p(i) = h\u-\(x). If we can set 
up a data structure that makes it possible to calculate A(i) from Xj, we can calculate a one-to-one 
mapping from S into {0, . . . ,m— 1} — a perfect hash function. But providing such a data structure 
is in essence the retrieval problem! We know how to solve this with a data structure with [log k~\ ■ m 
bits. In the special case of a matrix that can be brought into echelon form by permuting rows and 
columns the system of equations can even be solved in linear time. 

Chazelle et al. [11] and Botelho et al. [I] gave different arguments why for m sufficiently large 
matrix M should (with high probability) admit a transformation into upper triangular form by 
row and column permutations. Chazelle et al. [11] gave ad hoc calculations leading to the bound 
m > kn. Botelho et al. [4] used results from random (hyper)graph theory to state much smaller 
bounds (no closed formula): m > 1.222n for k = 3 and m > 1.295n for k = 4, and so on, with the 
constants growing for growing k. (The question asked in [4] is whether the hypergraph given by 
A x , x G S, is "acyclic".) We avoid the use of random graph theory and resort to Calkin's theorem 
(Theorem [4]) to show that the bounds /3Z from Table [T] are relevant for this situation as well. 
The disadvantage of our approach is that the algorithms that construct the data structure need 
more time, since they involve Gaussian elimination. Again, the splitting trick from Appendix iBl can 
alleviate this problem. 

Assume the matrix M has full row rank. We first calculate a pseudoinverse C that satisfies eq. 
([5]) in Section 12.21 Since columns b\ , . . . , b n of C ■ M form a regular quadratic matrix, and C ■ M is 
obtained from M only by row transformations, columns b\, . . . , b n of M also form a regular matrix. 
This means that the determinant of the submatrix of M formed by these columns is nonzero - 
hence, by the definition of the determinant as a sum of products over all permutations, there must be 
a bijection ip : {1, . . . , n} «-> . . . , b n } with Pi^u\ / 0, hence P^tpU) = 1, for 1 < t < n. This means 
that ip(i) G A Xi , for 1 < i < n. The mapping ip may be found by an efficient algorithm to calculate 
perfect matchings in bipartite graphs. For each i, from <p(i) we obtain a value G {1, . . . , k} 
such that (p(i) = h\u)(xi). 

We form a vector (u\, . . . , u n ) by defining U{ to be (the binary representation of) X(i) — 1, using 
r = [log A;] bits. Applying the construction from Section I2T21 we find a vector a = (do, ■ ■ ■ , a m -l) 
with elements in {0, l} r such that h a {xi) = A(i) — 1, for 1 < i < n. 

Then the function 

h: U {0, . . . ,m - 1} , x i-> h ha{x)+l {x) 

is a perfect hash function for S with range {0, . . . ,m — 1}. Evaluating the function amounts to 
calculating h a (x) as given by (j3]). The function h is represented by the table that contains the 
components of a. This takes m = (1 + S)n words of [log A;] bits, where 5 > fi^ 1 — 1 is arbitrary. 

Since P^ 1 ~ 1 + e _fc /ln2 for k — * do, the relative space overhead 5 may be made as small as 
we wish, at the cost of larger k. A particularly attractive choice is k = 4. Since fi^ 1 < 1.035, we 
could choose m = 1.035n and spend 2m bits for the representation of the vector a, which amounts 
to space requirements of 2.07n bits. 

Bothelho et al. [1] describe how a perfect hash function may be turned into a minimal perfect 
hash function. There are several plausible techniques for this, one of them as follows: One stores the 
set of locations in {0, . . . , m — 1} — h(S) in a succinct rank data structure [30|. This table requires 
additional space of 0.035n Tog 2 (l. 035/0. 35) +n • log 2 (l. 035/1) 0.22n + o(n) bits. The total space 
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needed for the minimal perfect hash function is 2.29n + o(n) bits, which is a little better than the 
2.7n from [5J. The price to pay for this improvement is that to find the vector a we must solve 
a system of linear equations and solve an instance of the perfect matching problem, while in [3] 
a very simple linear-time algorithm is sufficient. There is no big difference in the time needed to 
evaluate the (minimal) perfect hash function. 

8 Open problems 

Our construction relies on either the assumption that the hash functions are fully random, or on 
the split-and-share construction. An obvious open problem (shared with most data structures that 
use multiple hash functions) is whether simpler hash functions can do the same job. 

Another question is to which extent the correspondence shown in Theorem QT] of Section [6] also 
holds for small values of r. For example, is the space threshold for fc-ary cuckoo hashing identical 
to Calkin's threshold for random matrices with k Is in each row? Similarly, if we imitate blocked 
cuckoo hashing [16] by restricting the sets A x . to be a subset of the union of two intervals of size 
k (depending on Xj), what is the best space usage we can get? 

Acknowledgement. The authors thank Philipp Woelfel for several motivating discussions on 
the subject. 
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A Creating random sets of size k without repetitions 

We briefly justify the assumption that given k fully random hash functions with ranges we can 
choose there is a way to map each key x to a fully random sequence (or ordered set) A x = 
(h\(x), . . . , hk{x)) with all different values in [m]. Just take k fully random hash functions gi,---,gk 
where gi has range [m — £ + 1], for £ = 1, . . . , k. The existence of the sequence A x is then easily 
proved: For I = 1, . . . , k, let ht{x) be element number gi(x) in the set [m] — {h\(x), . . . , hi-i(x)}. 
Algorithmically, it is simpler to work with an array B[0..m — 1], initialized with B[j] = j. Then, 
sequentially for £ = 1, . . . , k, the value at position B[gi(x)] is exchanged with the value at B[m — £]. 
The output sequence is (h±(x), . . . , hk(x)) = (B[m — 1], . . . , B[m — k]). Clearly, this is a random 
sequence of k distinct elements of \m\. If space is an issue, one might not actually use this array 
B, but just simulate the effect. Time 0(klogk) (using a search tree) or expected time 0(k) (using 
hashing) is definitely sufficient. If the "split-and-share" approach of Appendix [B] is employed, the 
space used for array B is 0(n £ ). 

B How to circumvent the full randomness assumption: 
"split-and-share" 

While it is quite common to assume that fully random hash functions are available for free in the 
context of Bloom filters and similar data structures, in the realm of dictionary implementations 
and construction of perfect hash functions one prefers to use randomization in the algorithm, viz., 
universal hashing. We briefly sketch how in the context of the static data structures of this paper 
universal hashing may be used to justify the full randomness assumptions. (For details, see |15j.) 

We assume the reader is familiar with the concept of a universal class of hash functions from 
U to M (for distinct keys x,y £ U and h chosen at random from the class we have Pr(/i(x) = 
h(y) < 1/|M|) as well as the concept of /c-wise independent hash classes (for arbitrary distinct 
keys xx,. . . ,Xf. G U and h chosen at random from the class the hash values h(x\), . . . , h[xk) are 
fully random in M). Constructions of such classes, with arbitrary ranges M = [m], are well known, 
see [TO1[32]. 

Let S C U, \S\ = n, be fixed. It is a common idea to use a hash function h: U — » [m] to split 
S into "chunks" Si = {x £ S \ h(x) = i}, for i £ [m], and then work on the "chunks" separately. 
In our context, we would e. g. construct a retrieval structure for each Si separately in a dedicated 
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segment of memory. A more recent idea [15\ [To] is to use a "shared" table of random words to 
simulate hash functions that are fully random on each single chunk. The total space usage of this 
table may be kept at o(n). 

If m > 2n 2/3 and h: U [m] is chosen from a 4-universal class, then (as a standard calculation 
shows) the largest chunk Si will have at most y/n elements with probability larger than |. We 
repeat choosing such /i's until one is found that satisfies max{|Si| | < i < m} < ^/n. This we fix 
and call it h°. The size of Si is called rij. 

It is a folklore fact that using n\ +e space it is very easy to provide a data structure that gives 
fully random hash functions on Si, which can be evaluated in constant time. A concrete construction 
might look as follows. 

Lemma 12 (e.g. |15| ) Let r = 2n 3//4 , and let S' C U with n' = \S'\ < y/n. Given a 1-universal 
class of hash functions from U to [r], we may in expected time 0(|<S"|) find two functions ho, hi 
from that class such that if the two tables 7o[0..r — 1] and 7i[0..r — 1] are filled with random numbers 
from [t] then h'(x) = (7n[/io(£)] + 7i[/ii(x)]) mod t defines a function that is fully random on S' . 

This simple observation may be used as follows. For each of the m chunks Si we find and fix 
and store two functions hi$,hn (constant description size) as in Lemma [T2l Further, we provide 
a suitable number L of pairs of tables T^o and Tj i, 1 < j < L, independently filled with random 
numbers from [t] (we may even use varying ranges [tj]). Then 

K,j( x ) = (T,,o[/ii,o( x )] + T i,i[^,i(^)]) mod t , for 1 < j < L, 

provides L fully random hash functions on Si. The total space taken up by the tables is no more 
than 2n 3 / 4 • L numbers, so any L = C^n 1 / 4- *?/ log(t)) will lead to a space usage of 0(n 1_7? ) bits. 

The central observation is that this setup makes it possible to work with L truly fully random 
hash functions on Si to construct a data structure Di for solving the respective problem (retrieval, 
approximate membership) for the keys in Si. As long as we keep the data structures disjoint, the 
fact that the tables are shared is harmless. More details may be found in |15j . 

Remark 9 In Remark [8] we mentioned a subtle problem in the false positive probability in the 
approximate membership problem in case we use the split-and-share approach to simulate the hash 
functions. The problem is caused by the fact that a query point y G U — S is not in Si for i = h(y) 
and hence there is no guarantee that q (simulated by /i^Oi^j.i and some pair T^o and T^i) is fully 
random on Si U {y}. By the proof of Lemma [T2l as given in [15] it can be seen that the probability 
that this happens for any fixed y is 0(1/ \/n). This term has to be added to the false positive 
probability. 

C Space lower bound for approximate membership 

For completeness we prove a lower bound for the space needed for an approximate membership 
data structure. The proof is a slight extension of the lower bound of Carter et al. [9]. 

Theorem 13 (Carter et al. [9]) Let u = \U\ and consider an approximate membership data 
structure for sets of size n > 1 and false positive probability e, < e < 1. Then the space usage in 
bits must be at least nlog 2 (l/e) — O ^ e u+Q-e)n ) ' Specifically, for u > n 2 /e the space usage must 
be at least nlog 2 (l/e) — 0(1) bits. 
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Proof: Any instance 1 of the data structure corresponds to a subset Uj C U, namely the set 
of elements x for which the data structures answers "x £ 5"'. For any set S C £/ there must 
be some instance X for which S 1 C Uj. Furthermore, there must exist such an instance where 
\Uj\ < e(u — n) + n. This is because the expected number of false negatives in U — S is at most 
e(u — n) (when choosing the data structure for S). We say that the instance covers S if these two 
conditions hold. The number of sets that can be covered by an instance I is ('^f') < (L £ («-™)J+ ra ^ 
This means that the number of instances needed to cover all subsets of U of size n is bounded from 
below by: 

□ > u(«-l)---(u-n + l) 



([e(u-n)\+n^ - ( e ( n _ n ) + n){e(u - n) + n - 1) ■ ■ ■ (e(u - n) + 1) 

> , 

e(u — n) + n J 
If (1 - e)n 



e \ £u + (1 — e)n 

,l\ n ( (l-e)n 2 
> - exp ' 



£ ) \ £u + (1 — e)n 

Since each instance has a unique memory representation, this means that the number of bits used 
by the data structure in the worst case must be at least log 2 of the number of instances. □ 
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